Scalable Fault Tolerance in Multiprocessor Systems

نویسندگان

  • Gagan Gupta
  • Gurindar S. Sohi
چکیده

Evolving trends in design and use of computers are resulting in fault-prone systems which may not execute a program to completion. Checkpoint-and-recovery is commonly used to recover from faults and complete parallel programs. Conventional checkpointing-and-recovery can incur high overheads and may be inadequate in the future as faults become frequent. We propose to execute parallel programs deterministically to tolerate faults at lower overheads and scalably.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Event-driven Approach to Multiprocessor Diagnosis

For constructing fault tolerance mechanisms in large massively parallel multiprocessor systems, a scalable fault diagnosis is necessary, which works efficiently even if there are several thousand processors in the system. In this paper we present an event-driven, distributed system-level diagnosis algorithm, based on a general diagnosis model which does not limit the number of simultaneously ex...

متن کامل

Fault Tolerance for Multiprocessor Systems Via Time Redundant Task Scheduling

Fault tolerance is often considered as a good additional feature for multiprocessor systems but nowadays it is becoming an essential attribute. Fault tolerance can be achieved by the use of dedicated customized hardware that may have the disadvantage of large cost. Another approach to fault tolerance is to exploit existing redundancy in multiprocessor systems via a task scheduling software stra...

متن کامل

Adaptable Fault Tolerance Configurations for Multiprocessor Systems

The escalating increase in the complexity of multiprocessor systems increases the probability of faults occurring in these systems As a consequence there is a great need for achieving fault-tolerance of processing in multiprocessor systems. Faulttolerance generally requires some forms of hardware and/or time redundancy. Two fault tolerant configurations are proposed for both single and double t...

متن کامل

Analysis of Selective Fault - Tolerant , Hard Real - Time

An increasing number of applications are demanding real-time performance from their multiprocessor systems. For many of these applications, a failure may produce disastrous results. Such failures are avoided in hard real-time systems by the use of fault-tolerance. In hard real-time multiprocessor scheduling, this fault tolerance may be provided by including several task backups in each schedule...

متن کامل

Architecture and Realization of the Modular Expandable Multiprocessor System MEMSY

Abstract The experimental multiprocessor system MEMSY2 will be described. This system was built to validate the concept of a scalable multiprocessor architecture based on local shared-memory. Main application areas are scientific computations with high demand for processing power and memory capacity. In designing the hardware architecture the extensive use of standard components and fault toler...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014